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Abstract 

The compact Genetic Algorithm (cGA) is an Estimation of Distribution Algorithm that 
j--p~] ' | generates offspring population according to the estimated probabilistic model of the 

parent population instead of using traditional recombination and mutation operators. 
The cGA only needs a small amount of memory; therefore, it may be quite useful in 
^ ' memory-constrained applications. This paper introduces a theoretical framework for 

| studying the cGA from the convergence point of view in which, we model the cGA by a 

Markov process and approximate its behavior using an Ordinary Differential Equation 
(ODE). Then, we prove that the corresponding ODE converges to local optima and 
stays there. Consequently, we conclude that the cGA will converge to the local optima 
of the function to be optimized. 
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1 Introduction 



One of the most famous optimization procedures for combinatorial optimization is 
the Genetic Algorithm (GA). By maintaining a population of solutions, the GA can 
be viewed as implicitly modeling of the solutions seen in the search process. In 
$_j ■ the standard GA, new solutions are generated by applying randomized recombina- 

tion opera t ors o n two or more high-quality individuals of the current population 
(Goldberg, 1989|) . These recombination operators, such as one-point, two-point or 



uniform crossover, randomly select non-overlapping subsets of two "parent" solutions 
to form "children" solutions. By using a crossover operator that preserves groups 
of parameters from parents to children, the GA attempts to capture dependencies 
between the parameters implicitly. 

The poor behavior of genetic algorithms in some problems, sometimes attributed 
to designed operators, has led to the development of other types of algorithms. The 
Probabilistic Model Building Genetic Algorithms (PMBGAs) or Estimation of Distribu- 
tion Algorithms (EDAs) are a class of algorithms which has been developed recently to 
preserve the building blocks ( Larranaga and Lozand. [200l|) . The principal concept in 



this new technique is to prevent the disruption of partial solutions contained in a solu- 
tion by building a probabilistic model. The EDAs are classified into three classes based 
on the interdependencies between the variables of solutions ([Larranaga and Lozand. 
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20011) dPelikan et all ll999): the no dependencies model, the bivariate dependencies 



model, and the multiple dependencies model. To name just a few, inst ances o f EDA 
algorithms include the Po pulation-based Incremental Learning (PBIL) (Baluia, 1994) 
(Baluj a and Caruanalll995|) , the Bit-based Simulated Crossove r (BSC) dSvswer dal|l9 92), 
the Univariate Marginal Distributio n Algorithm (UM DA) jMuhlenbeinl, Il998|) , the 
compact Genetic Algorithm, (cGA) jHarik et all Il999h and the Learning Automata 
based Estimation of Distribution Algorithm (LAEDA) jRastegar and Meybodi , 2005a) 
for the no dependencies model. Mutual Information Maximization for Input Clustering 
(MIMIC) (De Bonet et al., 1996) an d Com bining Optimizer with Mutual Information 
Trees (COMIT) jBaluja and DaviesL Il997|) for the bi variate dependencies model, and 
finally, the Factorized Distribution Algorithm (FDA) (Muhlenb ein and Mahnid, [l999b) 
and the Bayesian Optimization Algorithm (BOA) dPelikan et al.l,ll999|) for the multiple 
dependencies model. 



Some researchers have studied the working mechanism of EDAs. The behavior 



of th e UMDA and the PBIL has bee n studied in (Muhlenbein, 1998) (Gonzalez etal 



f iQOh dHohfeld and Rudolph], Il997t) dZhand, l2004al) dRastegar and Mevbodif T2005b) 
leil and Rowel, l2005|) . Both dZhang], l2004b|) and dMuhlenb eln and Mahnig l, ll999a|) 
dis cuss the convergence of t he F DA for separable additively decomposable functions. 
In dZhang and Muhlen beinl.l2004|) , Zhang and Mhlenbein hav e proven that a class of 
EDAs with an infinite population size globally converges. (Rastega r and Meybodil. 



2005c) carried out a study on the time complexity of EDAs with an infinite population 
size. 



Although all algorithms for the no dependencies model have low efficiency in 
solving difficult problems, it is still important to study them due to their simplicity in 
terms of memory usage and computational complexity and with respect to the fact that 
the computational complexity of the bivariate dependencies model and the multiple 
dependencies model is high. One of the simplest algorithms of the no dependencies 
model is the compact Genetic Algorithm (cGA). This algorithm initializes a Probability 
Vector (PV), where each component of the PV follows a Bernoulli distribution with 
the parameter of 0.5, and then two solutions are randomly generated by using this 
PV. The generated solutions are ranked based on their fitness values. Then, the PV 
is updated based on these solutions. This process of adaptation continues until the 
PV converges. The cGA represents the population as a PV over a set of solutions and 
operationally mimics the order-one behavior of the Simple GA (SGA) with the uniform 
crossover. When confronted with easy problems (e.g., continuous-unimodal prob- 
lems involving lower order BBs), the cGA achieves the performanc e of the SGA (with 
the uniform crossover) in terms of the number of fitness evaluations dHarik et alj,ll999|) . 



By now, some variations of cGA ha ve been introduced and some of them have 
been u tilized in real world applications dAhn and Ramakrishnal,E003l) dG allaghe r et al. , 
2004|) jBaraglia et all l2001a|) jSastry and Goldberg!, l2000|) jBaraglia et all l2001b|) , but 



the behavior of the cGA has not been studied in details. To the best knowledge of the 
authors, the only papers in which the analy tical a naly sis of the cGA has been done are 
dAhn and Ramakrishnal, |2003|) and dDrosteLE005]) . In dAhn and Ramakrishna1,l2003|) , it 
has been prov en that the elitism-based cGA is equal to the Evolution Strategy (ES). In 
( Drostel. |2005|) , Droste has presented the first rigorous runtime analysis of the cGA for 
linear pseudo-boolean functions. He has shown that not all linear functions have the 
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Parameters: a is the learning step, n is the solution length 
Step 1. Set k to 0, and initialize the probability vector 

For i := 1 to n do Pi(k) := 0.5; 
Step 2. Generate two solutions from the probability vector 

a(k) := generate(p(fc)); b{k) :=generate(p(fc)); 
Step 3. Let them compete 

w(k),l(k) :— compete (a(k), b{k)); 

where w(k) and l(k) are winner and loser solutions respectively, 
(if both a(k) and b(k) have the same fitness value then a(k) is selected as w(k)) 
Step 4. Update the probability vector 
For i := 1 to n do 

If Wi (k) ^ U{k) then 

If Wi(k) == 1 thenpi(k + 1) := p t (k) + a; 
Elsep 4 (fc + 1) := pi(k) - a; 
Step 5. Check if the probability vector has converged. 
Go to Step 2, if it is not satisfied. 



Figure 1 : Pseudocode of the cGA 

same asymptotical runtime. In this paper, we study the cGA as a recursive stochastic 
algorithm. We model the cGA as a Markov process and approximate it by an ODE 
where its learning step is small. Then, we study the behavior of the obtained ODE and 
determine its convergence and stability properties. 

This work is organized as follows. Section 2 describes the cGA precisely. In Section 
3, a formulation of the cGA and some required definitions and lemmas are stated. In 
Section 4, the analysis of the cGA as a Markov process is done in two stages. In the first 
stage, we derive an ODE whose solution approximates the asymptotic behavior of the 
cGA. Then in the second stage, we prove that the corresponding ODE and therefore, 
the cGA surely converge to the local optima of the function to be optimized and stays 
at them. Finally, Section 5 concludes the paper. 

2 The Compact Genetic Algorithm 

At each iteration k, the cGA manages its population as a PV, p(k) = (pi(fc), ...,p n (k)), 
where n is the number of g enes, thereby it m imics the order-one behavior of the SGA 
with the uniform crossover (Har iket ali.ll999|) . The value of ft(fe) S [0, 1], i — 1, n , 
measures the proportion of the allele "1" in the ith locus of the simulated population. 
Figure 1 describes the pseudocode of the cGA. 

For i = 1, n, Pi(0) is initialized with 0.5 to represent a randomly generated 
population. In each generation (i.e. iteration), two competing solutions are generated 
on the basis of the current PV and then the PV is updated to favor the better solution 
(i.e. winner). The probability Pi(k) is increased (decreased) by the learning step, a, 
when the ith locus of the winner has an allele of "1" (resp. "0") and the ith locus of 
the loser has an allele of "0" (resp. "1"). If both the winner and the loser have the 
same allele in the ith locus, then the probability Pi(k) remains the same. This scheme is 
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equivalent to the (steady-state) pair- wise tournament selection jHarik et al. , 1999|) . The 
cGA terminates when all the probabilities converge to zero or one. 



3 Problem Formulation 

Let y = (yi,...,y n ) denote a solution where yi belongs to {0,1} and consider that 
g : il —> 5ft is an injective pseudo-boolean function to be maximized, where ft = {0, 1}™. 
The goal is to maximize g using the cGA. At the kth iteration of the optimization pro- 
cess, two solutions w(k) and l(k) are generated on the basis of p(k) where g(w(k)) > 
g(l(k)), and then, the PV is updated as follows, 

Pi(0) = 0.5, 1 < i < n 

P {k + 1) = P (k) + a(w(k) - l(k)) (1) 

To prevent piS from getting smaller than or larger than 1, we let a be equal 
to l/(2Af), where is a positive integer number. In the remainder of this section, 
we introduce our definitions and derive some results that will be used later for the 
analysis of the cGA. 

Definition 1. A solution y is called a local maximum of the function g, if and only if, 
for each solution z, whose hamming distance to the solution y is one, i.e. dniy, z) = 1, 
we have g(y) > g(z). A local maximum is called strict if the inequality is strict. 

Definition 2. The configuration space of the cGA is K = [0, 1]™ where p(k) e K for 
each k. Also K* — {0, 1}" (K* is equivalent to 51) is called the corner (the deterministic 
subspace) of K and K — K* is called the non-deterministic subspace of K. 

Definition 3. p(k) is called a deterministic configuration if Pi(k) = or 1 for every 
i = l,...,n, i.e. p(k) G K* , in the other cases, p(k) is called a non-deterministic 
configuration, i.e. p(k) G K — K* . 

Lemma 1. Let Pr(w{k) — y\p(k)) be the probability of obtaining y as the winner solu- 
tion of the fcth iteration. Then 



Pr(w(k) = y\p{k)) = Pr(y\p(k)){ £ Pr(z\p(k))+ £ Pr(z\p(k))} (2) 

g(z)<9(y) g(z)<g(y) 

where Pr(y\p(k)) denotes the probability of sampling the solution y. 

Proof. At each iteration, two solutions are sampled from p(k). The probability that 
the first sampled solution is equal to y and be the winner solution is Pr(y\p{k))Pr(a\\ 
z\p(k), g(y) > g{z)), and the probability that the second sampled solution is the 
winner solution and be equal to y is Pr(all z\p(k), g(z) < g(y))Pr(y\p(k)). Therefore, 
Pr(w(k) = y\p(k)) is equal to the sum of these probabilities. Hence the proof, o 

Lemma 2. Let Pr(l(k) = y\p(k)) be the probability of obtaining y as the loser solution 
of the fcth iteration. Then 
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Pr(l(k) = y\p(k)) = Pr(y\p(k)){ £ Pr(z\p(k))+ £ Pr(z\p(k))} (3) 

g(z)>g(y) g(z)>g(y) 

where Pr(y\p(k)) denotes the probability of sampling the solution y. Proof is 
similar to the proof of Lemma 1. 

Lemma 3. Assume that p m and y m are the mth positions of p and y respectively. Then 
equations (4)-(9) are true for Pr(y\p), 



Pr(y\ P ) = l[pr(i-p t ) 1 ^ 



1=1 



rr{y\ P ) < x for peK * p = y 



dPr(y\p) 



dp 

dPr(z\p) 



1 if Vra = 1 

-1 if Vm = 



9pn 



= for all z whose dii(z,y) > 2 



dPr(z\p) 



dpn 



dPr(z\p) 



dp„ 



1 if d H (z, y) = 1 and z m = 1, y m = 
-1 if d H (z,y) = 1 and z m = 0, y m = 1 

= if d H (z, y) = 1 and y m = z m 



(4) 
(5) 
(6) 
(7) 
(8) 
(9) 



where y € f2 and Pr(y|p) denotes the probability of sampling the solution y. 
Proof. Equation (4) is trivial by the fact that all t/,;s are independent. The other results 
can be easily obtained by (4) (|Gonzalez et alll2000|) . o 



4 Analysis of the Compact Genetic Algorithm 

Under the algorithm specified by (1), {p(k), k > 0} is a Markov process. The analysis of 
this process is done in two stages. In the first stage, we derive an ODE whose solutions 
approximate the asymptotic behavior of p(k) for a sufficiently small learning step a (i.e. 
N tends to infinity) used in (1). In the second stage, we characterize the solutions of the 
ODE and thus, we obtain the long-term behavior of p(k). 
The algorithm given by (1) can be represented as 

p(k + 1) = p(k) + aG(p(k), w(k), l(k)), where G(p(k), w(k), l(k)) = w(k) - l(k) (10) 

w(k) and l(k) denote the winner and the loser solutions respectively and a is the 
learning step. Now, define 

Ap(k) = E{p(k + l)\p(k)} - p(k) (11) 

where E {■} is the mathematical expectation operator. Since {p(k);k>0} is 
Markovian and w(k) and l(k) only depend on p(k) not on k, then Ap(k) can be given 
as follows. 
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Ap(fc) = af(p(k)) (12) 

where / : A — > A' and 

/(p) - £?{G(p(fc), «;(*;), /(fc))|p(fc) =p} = E{w(k) - l(k)\p(k)=p} (13) 
The function /(p) can be rewritten as follows, 

/(p)=^MA)lp(fc)}-SO(A)lP(*)} 

= £ yPr(w(k) = y\p(k)) - J2 yPrd(k) = y\p(k)) (14) 

= £ y(Pr(w(k) = y|p(fc)) - Pr(i(fc) = y|p(fc))) 
y 

By Lemma 1 and Lemma 2, and some simplification, we have 

f(p) = 2j2yPr(y\p(k)){ ]T Pr{z\p(k))- ]T Pr(z|p(fc))} (15) 

V g(z)<g(y) g(z)>g(y) 

Now, define a sequence of continuous-time interpolations of (10) denoted by p a {t) 
and called an interpolated process, whose components are defined by 

pf(t) = Pi (k), l<i<n,te [ka, (k + l)a) (16) 

The interpolated process {p a (t)} t>0 is a sequence of random variables that takes 
values in D n , the space of all right continuous functions with left hand limits defined 
over [0, oo) and p a takes values in A which is a bounded subset of Si n . The objective 
is to study the limit behavior of the sequence {p a (t)} t>0 as a (resp. N) tends to zero 
(resp. infinity), which will be a good approximation of the asymptotic behavior of (16). 
When a tends to zero, (12) can be written as the following ODE 

We are interested in characterizing the long-term behavior of p(k) and hence the 
asymptotic behavior of the ODE (17). Now, we show that the sequence of interpolated 
processes {p a {-)} weakly converges to the solution of the ODE (17) with the initial 
configuration p(0). This implies that asymptotic behavior of p(k) can be obtained from 
the solution of the ODE (17). 

Theorem 1. Consider the sequence of interpolated processes {p a (t)}. Let X = p a (0) — 
p(0). When a — > 0, the sequence weakly converges to X(.) which is the solution of the 
ODE, 

^ = f(X),X(0)=X (18) 

Proof. The theorem is a particular case of a general result to weak convergence theorem 
( (|Kushnerl,ll984)) , Theorem 3.2). We note the following about the cGA given by (1), 

1. {p(k), (w(k — 1), l(k — 1)), k > 1} is a Markov process. 

2. (w(k), 1(h)) takes values in a compact metric space. 
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3. The function G(., ., .) is bounded, continuous, and independent of a. 

4. For a specific configuration, p(k) = p, {(w(k), 1(h)), k > 0} is an independent iden- 
tically distributed (i.i.d.) sequence. Let M p be the distribution of the process 

{(w(k),l(k)),k>0}. 

5. The ODE (18) has a unique solution for each initial condition. 

Hence, by using the weak convergence theorem ( jKushnerl Il984|) . Theorem 3.2), 
when a — > 0, the sequence {p a (-)} weakly converges to the solution of the ODE 

^ = G(X), X(0) = X 

where G(p) = E p G(p(k),w(k),l(k)) and E p denotes the expectation with respect 
to the invariant measure M p . Since for p(k) — p, (w(k), l(k)) is an i.i.d. sequence whose 
distribution only depends on p and the function g, we have 

G(p) = E{G(p(k),w(k),l(k))\p(k)=p} = f( P ) (19) 
Hence the theorem, o 

Theorem 1 enables us to understand the long-term behavior of p(k). The weak 
convergence in this theorem implies that when a tends to zero, the trajectory of p a (t) 
will closely follow the solution of the ODE with a high probability at any finite interval. 
As the length of time interval increases and a tends to zero, the trajectory of the ODE 
spends most of the time required by the optimization process in a small neighborhood 
of p°, the solution of the ODE. Thus, p a (.) will eventually (with a high probability) 
spend all of its time in a small neighborhood of p° as well. As a tends to zero, the cGA 
follows the trajectory of the ODE in a time interval, which tends to infinity. The above 
point is summarized in the following Lemma. 

Lemma 4. For a large k and a small enough value of a, the asymptotic behavior of p(k) 
can be approximated by the solution of the ODE (18) with the same initial configura- 
tion. 

Proof. Let X{.) be the solution of the ODE (18) with the initial condition of X(0) = X 
which is sufficiently close to an asymptotically stable configuration of the ODE, say 
p° € K. For any Y(t) 6 K, t > and any positive T < oo, define 

h T (Y)= sup \\Y(t)-X(t)\\ (20) 

0<t<T 

Function Tjt(-) is continuous on K. Theorem 1 states that E{hr(p a )} tends to 
E {hr(X)} = as a —> 0, the limit is zero since the value of hr(X) on the trajectories 
of limit process is zero with probability one. Thus, the sup of the distance between the 
original sequence p(t) and X(t) goes to zero in probability as k tends to infinity. With 
the particular initial condition used, let p° be the stationary configuration to which the 
solution of the ODE converges. Using this and the nature of interpolation, given in 
(16), it is implied that for the given initial configuration, any e > 0, and the integers 
Ki, K2, < K 1 < K2 < 00, there exists an uq such that 

Prob{ sup >e} = 0, Va < a (21) 

K 1 <k<K 2 
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Thus, if the ODE has an asymptotically stable configuration p°, then for all initial 
conditions, which are sufficiently close to p°, the cGA essentially converges to p°. o 

In the rest of the analysis, we consider the stability properties of the ODE and we 
talk in terms of stability, unstability, etc. about some configurations in K and finally, 
we study the convergence of the cGA. 

4.1 Stationary Configurations and the Stability Property 

The following theorem characterizes the solutions of the ODE and hence, states the 
long-term behavior of the cGA. 

Theorem 2. If the learning step is sufficiently small, the following is true for the cGA. 

1. All deterministic configurations are stationary configurations. 

2. All non-deterministic configurations are non-stationary configurations. 

3. All local maximums of g are asymptotically stable and the other points of fi are 
unstable. 

Proof. 

1. By inspection of (15), if p is a deterministic configuration, i.e. p G K* , then by 
Lemma 3 (5), f(p) = 0; therefore, p is a stationary configuration. 

2. Assume that S = {y\y E O, Pr(y\p) > 0}. It is clear that if p is a non-deterministic 
configuration, then Ns, the cardinality of S, is an even number greater than one. 



Em=E\fi(p)\=EHEViPr(y\ P )\ E Pr(z\p)- E Pr(z\p)\) 

t=l i=l i=l y g(z)<g(y) g{z)>g{y) 

= 2E {Prfc/WI E Pr(z\ P )- E Pr(z\p)\}dtvi) 

yes g(z)<g(y) g(z)>g(y) 1=1 

(22) 

g is an injective function so we have 

S = {y l \ y l en and! <i < N s } 

(23) 

where Ns > 1 and Mi < j, g{y l ) > g{y-') 
according to (23), (22) can be rewritten as 

n , N s n k-1 N s 

Y^\^\=2j^Pr{y k \p)C£yi)\Y, Pr ^\P)- E Pr(x j \p)\ (24) 

i=l k=l i=l j=l j=k+l 

by inspection of (24) and taking into account that Ns > 1, 

»=i 
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therefore, at least there is one i where 



dpi 
dt 



^0 



and consequently, p is not a stationary configuration. 

Note that if the function g is not injective, the result may be invalid and we cannot 
make sure that all non-deterministic configurations are non-stationary configura- 
tions. 



3. To pr ove this part of the theorem, we apply Lyaponov's indirect method (|Drazinl. 
1992) to f(p°) where p° e K* (p° can be considered as a binary string that belongs 



to SI). At first, we compute the Jacobian Matrix of /(.) in p°, 



T «,m (p0) 



dfiip) 



dPr(y\p) 



dpn 



( ( e map )- E Pr ( z \p ))} 

p0 g{z)<g{y) g{z)>g{v) 

T*> m (pO) 



dPr(z\p) 



g(z)<g{y) 



dpn 



E 



dPr(z\p) 



p g(z)>g(y) 



)}} 



(25) 



We split SI into three subspaces: Sli = {p }, Sl 2 = {y\dii{p ,y) = l}/ an d ^3 = 
{y\d H (p°,y) > 2}. Assume that 

w e Sl 2 , iv m ^ P m andVi ^ m, Wi=p i (26) 
By Lemma 3 and some simplification, two parts of (25) can be rewritten as 

9Pr(w|p) 



T{ (p v ) = Wi 



dpn 



{I(g(p )<g(w))-I(g(p°)>g(w))} (27) 



and 



T r(P°) = P ° dPrHp) 



dpn 



{I(g(w) < g( P )) - I(g(w) > g( P ))} (28) 



by (27) and (28), 



dfiip) 



dpn 



dPv(w\p) 



dpn 



( Wi -p°){I(g( P °) < g(w)) - I(g( P °) > g(w))} (29) 



where the value of I(exp) is one when exp is true and it is zero when exp is false. 
If i to, by (26) we have 



Wi = p° or Wi - p° = 



(30) 
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therefore, 

dfi(p) 



dpr, 



(31) 



For i = m, we investigate two cases: 1) p° is a local maximum 2) p° is not a local 
maximum. If p° is a local maximum, i.e. for each y G £1 2 , <?(p°) > <?(?/), then by (29) 



dfm(p) 



dp„ 

and by (31) and (32), 



-2 < (32) 



J(/(p°)) = Dio ff {-2,...,-2} (33) 

Thus, all eigenvalues of J(/(p )) are -2 and by Lyaponov's indirect method, p a is 
an asymptotically stable stationary configuration. 

If p° is not a local maximum, then there is at least one v G fi 2 and there exists an 
index q where 

g(v)>g(p°), v q ^p° q (34) 
In this case, f or i = m = q, by (29) we write 

df q ( P ) 



dp q 



2 > (35) 



By (31) and (35), we conclude that J(f(p )) is a diagonal matrix, where at least one 
of its eigenvalues is positive and by Lyaponov's indirect method, /(.) is unstable 
inp°. o 

4.2 Convergence results 

Based on Theorem 2, we can conclude that the cGA will never stay in a configuration, 
which is not a local maximum of g. This still leaves one question unanswered. Is 
it possible that p{k) does not converge to a local maximum of g, for example, if the 
algorithm exhibits limit cyclic or chaotic behavior? Regarding this question, we 
provide the necessary condition for the cGA to converge to a local maximum of g. This 
is proven in Theorem 3 below. 

Theorem 3. For the initial configuration p(0) = (0.5, ...,0.5), the cGA always converges 
to a local maximum of g. 

Proof. The function /(.) is continuous on K, therefore, there is a differentiable function 
F where 

F : W 1 -> 3? 



dF 

VI < i <n and allp G K, —(p) = f,(p) (36) 

OPi 

10 Evolutionary Computation Volume x, Number x 



A Step Forward in Studying the Compact Genetic Algorithm 



Now, consider the variation of F along the trajectories of the ODE. By (17) and (36), 

i=l ^ i=l i=l 

Thus, F is non-decreasing along the trajectories of the ODE. Also, due to the nature 
of the algorithm given by (1), for the initial configuration (0.5, 0.5), the solution of the 
ODE (17), will be con fined to K which is a compact sub set of K™. Hence, by LaSalle's 
invariance principle ((Narendr a and Annaswamylll989|) , Theorem 2.7), asymptotically, 
the trajectories will be in the set K 1 = {p e K\(dF/dt)(p) = 0}. By (37), 

dF dn 

~^{p) = =* hip) = (^) = , 1 < * < n (38) 

Therefore, p is a stationary configuration of the ODE. Thus, the solution should 
converge to a stationary configuration. Since by Theorem 2 all stationary configura- 
tions which are not local maxima are unstable, the theorem follows, o 

Theorems 2 and 3 together characterize the long-term behavior of the cGA 
when the function g is injective. Theorem 2 states that only local maxima of g are 
asymptotically stable stationary configurations of the algorithm. In addition, Theorem 
3 shows that the cGA cannot converge to any point in K which is not a local maximum. 
If the function g is not an injective function, then Theorem 2 (part 2) may be invalid 
and we cannot make sure that the local maxima of g are the only stable stationary con- 
figurations of the cGA. In this case, the cGA may converge to some non-deterministic 
configurations and stay at them. 



5 Conclusion 

The cGA is an estimation of distribution algorithm. It is very simple and can be 
easily implemented in hardware. Using a small amount of memory, it can have 
many applications in the memory constraint problems. In this paper, a mathematical 
framework of the cGA, based on the weak convergence and the non-linear systems 
theories, has been proposed and consequently, the convergence behavior of the cGA 
has been studied. We have proven that the local maxima of an injective function are 
asymptotically stable stationary points of the cGA and shown that the cGA converges 
to one of these local maxima. While the results obtained in this paper are interesting by 
their own, they can also serve as one of the first steps towards using ODE in analysis 
of EDAs and the other evolutionary algorithms. 

There are a lot of open questions and we have planned to study them in the 
future. First of all, we are interested in extending our framework to non-injective 
functions and determining the convergence rate of the cGA for different functions. 
Recently, some theorems have be en developed in t he stochastic approximation theory 
that can be useful in this regard l|Kushner and Yinl.l2000|) . Since for each optimization 
algorithm chosen to solve a problem, the shape and size of the basin of attractions may 
be different, we also would like to compute the basin of attractions of local maxima 
for a determined function. Comparing the basin of attractions of local maximum for 
the cGA with the basin of attractions of local maxima for other algorithms help us in 
choosing a better algorithm for optimization of a determined function. For example if 
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we could show that the basin of attraction of global maximum for the cGA is bigger 
than the basin of attraction of the same point for the PBIL then we can predict that 
for different initial values the cGA will converge to the global maxima with a higher 
probability. 
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